Fault - Tolerant Clusters of Workstations with Single System Image
نویسندگان
چکیده
he computing trend is moving from clustering highend mainframes to clustering desktop computers. This trend is triggered by the widespread use of PCs, workstations, Gigabit networks, and middleware support for clustering. This paper presents new approaches to achieving fault tolerance and single system image (SSI) in a workstation cluster. A multicomputer cluster is a collection of node computers, which are physically connected by local area networks or high-bandwidth switch networks using optical fibres. The workstations in the cluster can work collectively as an integrated computing resource, that is a SSI, or they can operate as individual computers, separately.
منابع مشابه
Fault tolerant system with imperfect coverage, reboot and server vacation
This study is concerned with the performance modeling of a fault tolerant system consisting of operating units supported by a combination of warm and cold spares. The on-line as well as warm standby units are subject to failures and are send for the repair to a repair facility having single repairman which is prone to failure. If the failed unit is not detected, the system enters into an unsafe...
متن کاملFault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless chec...
متن کاملFault-tolerant Cluster Management for Reliable High-performance Computing
Clusters of COTS workstations/PCs are commonly used to implement cost-effective high-performance systems. A central coordinator/manager is often the simplest way to implement many of the operations required for managing these distributed systems. These operations include scheduling of parallel tasks, coordination of access to limited resources, as well as high-level coordination of fault tolera...
متن کاملDevelopment and Performance Analysis of a Fault Tolerant Algorithm for Cluster of Workstations
A Cluster of Workstations (COW) is network based multi-computer system, which is the most prominent distributed memory system aimed to replace supercomputers. A cluster of workstations can be viewed as a single machine in which one job is divided into n subtasks and delegated to n workstations in the COW architecture. To get the job completed, all subtasks assigned to component workstations mus...
متن کاملDDG Task Recovery for Cluster Computing
This paper presents a solution for the problem of transparent recovery of asynchronous distributed computation on clusters of workstations when a fault occurs on a node. If the system has fault-tolerant features, it can survive the fault and continues its computations. Performance degradation is unavoidable when hardware redundancies are not available. It is a large advantage if the long-runtim...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998